20 research outputs found
On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems
Nowadays, data are fundamental for companies, providing operational support by facilitating daily
transactions. Data has also become the cornerstone of strategic decision-making processes in
businesses. For this purpose, there are numerous techniques that allow to extract knowledge and
value from data. For example, optimisation algorithms excel at supporting decision-making
processes to improve the use of resources, time and costs in the organisation. In the current
industrial context, organisations usually rely on business processes to orchestrate their daily
activities while collecting large amounts of information from heterogeneous sources. Therefore,
the support of Big Data technologies (which are based on distributed environments) is required
given the volume, variety and speed of data. Then, in order to extract value from the data, a set
of techniques or activities is applied in an orderly way and at different stages. This set of
techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known
in the literature as Big Data pipelines.
In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data
Preparation, Data Quality assessment, and Data Analysis. These improvements can be
addressed from an individual perspective, by focussing on each stage, or from a more complex
and global perspective, implying the coordination of these stages to create data workflows.
The first stage to improve is the Data Preparation by supporting the preparation of data with
complex structures (i.e., data with various levels of nested structures, such as arrays).
Shortcomings have been found in the literature and current technologies for transforming complex
data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through
Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases.
While one of them is a general-purpose Data Transformation language, the other is a DSL aimed
at extracting event logs in a standard format for process mining algorithms.
The second area for improvement is related to the assessment of Data Quality. Depending on the
type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example
are optimisation algorithms. If the data are not sufficiently accurate and complete, the search
space can be severely affected. Therefore, this thesis formulates a methodology for modelling
Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation
of their assessment. This allows to discard the data that do not meet the quality criteria defined
by the organisation. In addition, the proposal includes a framework that helps to select actions to
improve the usability of the data.
The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the
challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of
methodological solutions that allow computing exhaustive optimisation problems in distributed
environments (i.e., those optimisation problems that guarantee the finding of an optimal solution
by exploring the whole search space). The resolution of this type of problem in the Big Data
context is computationally complex, and can be NP-complete. This is caused by two different
factors. On the one hand, the search space can increase significantly as the amount of data to
be processed by the optimisation algorithms increases. This challenge is addressed through a
technique to generate and group problems with distributed data. On the other hand, processing
optimisation problems with complex models and large search spaces in distributed environments
is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario.
As a result, this thesis develops methodologies that have been published in scientific journals and
conferences.The methodologies have been implemented in software tools that are integrated with
the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets
CHAMALEON: Framework to improve Data Wrangling with Complex Data
Data transformation and schema conciliation are relevant topics in Industry due to the incorporation of data-intensive business processes in organizations. As the amount of data sources increases, the complexity of such data increases as well, leading to complex and nested data schemata. Nowadays, novel approaches are being employed in academia and Industry to assist non-expert users in transforming, integrating, and improving the quality of datasets (i.e., data wrangling). However, there is a lack of support for transforming semi-structured complex data. This article makes an state-of-the-art by identifying and analyzing the most relevant solutions that can be found in academia and Industry to transform this type of data. In addition, we propose a Domain-Specific Language (DSL) to support the transformation of complex data as a first approach to enhance data wrangling processes. We also develop a framework to implement the DSL and evaluate it in a real-world case study
DMN for Data Quality Measurement and Assessment
Data Quality assessment is aimed at evaluating the suitability
of a dataset for an intended task. The extensive literature on data
quality describes the various methodologies for assessing data quality
by means of data profiling techniques of the whole datasets. Our investigations
are aimed to provide solutions to the need of automatically
assessing the level of quality of the records of a dataset, where data profiling
tools do not provide an adequate level of information. As most of
the times, it is easier to describe when a record has quality enough than
calculating a qualitative indicator, we propose a semi-automatically business
rule-guided data quality assessment methodology for every record.
This involves first listing the business rules that describe the data (data
requirements), then those describing how to produce measures (business
rules for data quality measurements), and finally, those defining how to
assess the level of data quality of a data set (business rules for data quality
assessment). The main contribution of this paper is the adoption of
the OMG standard DMN (Decision Model and Notation) to support the
data quality requirement description and their automatic assessment by
using the existing DMN engines.Ministerio de Ciencia y Tecnología RTI2018-094283-B-C33Ministerio de Ciencia y Tecnología RTI2018-094283-B-C31European Regional Development Fund SBPLY/17/180501/00029
Enabling Process Mining in Aircraft Manufactures: Extracting Event Logs and Discovering Processes from Complex Data
Process mining is employed by organizations to completely
understand and improve their processes and to detect possible deviations
from expected behavior. Process discovery uses event logs as input data,
which describe the times of the actions that occur the traces. Currently,
Internet-of-Things environments generate massive distributed and not
always structured data, which brings about new complex scenarios since
data must first be transformed in order to be handled by process min ing tools. This paper shows the success case of application of a solution
that permits the transformation of complex semi-structured data of an
assembly-aircraft process in order to create event logs that can be man aged by the process mining paradigm. A Domain-Specific Language and
a prototype have been implemented to facilitate the extraction of data
into the unified traces of an event log. The implementation performed
has been applied within a project in the aeronautic industry, and promis ing results have been obtained of the log extraction for the discovery of
processes and the resulting improvement of the assembly-aircraft process.Ministerio de Ciencia y Tecnología RTI2018-094283-B-C3
Enabling Process Mining in Airbus Manufacturing : Extracting Event Logs and Discovering Processes from Complex Data
Ministerio de Ciencia y Tecnología RTI2018-094283-B-C3
Treatment with tocilizumab or corticosteroids for COVID-19 patients with hyperinflammatory state: a multicentre cohort study (SAM-COVID-19)
Objectives: The objective of this study was to estimate the association between tocilizumab or corticosteroids and the risk of intubation or death in patients with coronavirus disease 19 (COVID-19) with a hyperinflammatory state according to clinical and laboratory parameters.
Methods: A cohort study was performed in 60 Spanish hospitals including 778 patients with COVID-19 and clinical and laboratory data indicative of a hyperinflammatory state. Treatment was mainly with tocilizumab, an intermediate-high dose of corticosteroids (IHDC), a pulse dose of corticosteroids (PDC), combination therapy, or no treatment. Primary outcome was intubation or death; follow-up was 21 days. Propensity score-adjusted estimations using Cox regression (logistic regression if needed) were calculated. Propensity scores were used as confounders, matching variables and for the inverse probability of treatment weights (IPTWs).
Results: In all, 88, 117, 78 and 151 patients treated with tocilizumab, IHDC, PDC, and combination therapy, respectively, were compared with 344 untreated patients. The primary endpoint occurred in 10 (11.4%), 27 (23.1%), 12 (15.4%), 40 (25.6%) and 69 (21.1%), respectively. The IPTW-based hazard ratios (odds ratio for combination therapy) for the primary endpoint were 0.32 (95%CI 0.22-0.47; p < 0.001) for tocilizumab, 0.82 (0.71-1.30; p 0.82) for IHDC, 0.61 (0.43-0.86; p 0.006) for PDC, and 1.17 (0.86-1.58; p 0.30) for combination therapy. Other applications of the propensity score provided similar results, but were not significant for PDC. Tocilizumab was also associated with lower hazard of death alone in IPTW analysis (0.07; 0.02-0.17; p < 0.001).
Conclusions: Tocilizumab might be useful in COVID-19 patients with a hyperinflammatory state and should be prioritized for randomized trials in this situatio
Outcomes from elective colorectal cancer surgery during the SARS-CoV-2 pandemic
This study aimed to describe the change in surgical practice and the impact of SARS-CoV-2 on mortality after surgical resection of colorectal cancer during the initial phases of the SARS-CoV-2 pandemic
Analysis of Big Data Architectures and Pipelines: Challenges and Opportunities
Los continuos avances tecnológicos están promoviendo cambios en múltiples aspectos
de la sociedad. Una de las consecuencias de estos avances y cambios sociales es
el aumento de la cantidad de datos que se generan día tras día. En este escenario,
Big Data ha emergido como uno de los paradigmas más disruptivos de los últimos
tiempos, siendo de gran interés para múltiples tipos de organizaciones. Este interés
se debe a que Big Data está permitiendo a las organizaciones a extraer valor de los
datos que tienen a su disposición. Al mismo tiempo, Big Data está promoviendo
más cambios tecnológicos que están aumentando el potencial valor que se puede
extraer de los datos. Este valor permite a las empresas aumentar y optimizar su
capacidad productiva, contribuyendo a la mejora de sus ventajas competitivas, y
facilitando la toma de decisiones.
Como consecuencia, Big Data se ha convertido en uno de los campos más estudiados,
tanto en la literatura como en la Industria. Se trata de un campo que está
en continua evolución y que presenta unos retos y oportunidades muy sustanciales
que podrían aumentar la calidad del proceso de extracción de valor de los datos. Sin
embargo, al ser un campo en continua evolución, se requiere un estudio detallado y
conciso de todos los aspectos relacionados con este.
Este trabajo realiza un estudio sobre el estado del arte y los conceptos relacionados
con el paradigma Big Data, las actividades y técnicas relacionadas con el proceso
de extracción de valor de los datos, y las arquitecturas de procesamiento de los mismos.
Este estudio se estructura en tres partes. En la primera, se contextualizan los
conceptos y actividades relacionadas con el paradigma Big Data, proponiendo una
visión global de este. En segundo lugar, se identifican las principales limitaciones, retos,
oportunidades, y posibles líneas de investigación relacionadas con el paradigma
Big Data. Por último, se propone una solución a uno de los retos de investigación
que se plantean en este estudio: la preparación de datos con estructuras complejas.The continuous technological advances are promoting changes in multiple aspects
of society. One of the consequences of these advances is the increase in the amount
of data that is daily generated. In this scenario, Big Data has emerged as one of
the most disruptive paradigms in recent years, becoming a matter of great interest
for multiple types of organizations. This interest is due to the fact that Big Data
is enabling organizations to extract value from the data they own. At the same
time, Big Data is promoting more technical changes that are increasing the potential
value that can be extracted from data. This value enables companies to increase
and optimize their productive capacity, contributing to increase their competitive
advantages, and to ease the decision making process.
As a result, Big Data has become one of the most studied fields, both in literature
and in Industry. Consequently, it is constantly evolving, and presents significant
challenges and opportunities that could increase the quality of the process of value
extraction from data. However, since the Big Data paradigm is continually evolving,
a detailed and concise study about all aspects related to it is required.
In this work, a research about the state-of-the-art of the Big Data paradigm is
carried out. The concepts related to it, the activities and techniques on the value
extraction process, and the data processing architectures are studied. Next, the main
limitations, challenges, opportunities, and possible research lines related to the Big
Data paradigm are identified. Finally, a solution to one of the research challenges
that arise in this study is proposed: a framework to deal with the preparation of
data with complex structures.Universidad de Sevilla. Máster en Ingeniería Informátic
CC4Spark: Distributing event logs and big complex conformance checking problems
Conformance checking is one of the disciplines that best exposes the power of process mining, since it allows detecting anomalies and deviations in business processes, helping to assess and improve the quality of these. This is an indispensable task, especially in Big Data environments where large amounts of data are generated, and where the complexity of the processes is increasing. CC4Spark enables companies to face this challenging scenario in twofold. First, it supports distributing conformance checking alignment problems by means of a Big Data infrastructure based on Apache Spark, allowing users to import, transform and prepare event logs stored in distributed data sources, and solve them in a distributed environment. Secondly, this tool supports decomposed Petri nets. This helps to noticeably reduce the complexity of the models. Both characteristics help companies in facing increasingly frequent scenarios with large amounts of logs with highly complex business processes. CC4Spark is not tied to any particular conformance checking algorithm, so that users can employ customised algorithms.Ministry of Science and Technology of Spain: ECLIPSE (RTI2018-094283-B-C33) project; European Regional Development Fund (ERDF/FEDER); MINECO (TIN2017-86727-C2-1-R); University of Seville: VI Plan Propio de Investigación y Transferencia (VI PPIT-US).Peer ReviewedPostprint (published version
when data quality meets DMN
To succeed in their business processes, organizations need data that not only attains suitable levels of quality for
the task at hand, but that can also be considered as usable for the business. However, many researchers ground
the potential usability of the data on its quality. Organizations would benefit from receiving recommendations on
the usability of the data before its use. We propose that the recommendation on the usability of the data be
supported by a decision process, which includes a context-dependent data-quality assessment based on business
rules. Ideally, this recommendation would be generated automatically. Decision Model and Notation (DMN)
enables the assessment of data quality based on the evaluation of business rules, and also, provides stakeholders
(e.g., data stewards) with sound support for the automation of the whole process of generation of a recommendation regarding usability based on data quality.
The main contribution of the proposal involves designing and enabling both DMN-driven mechanisms and a
guiding methodology (DMN4DQ) to support the automatic generation of a decision-based recommendation on
the potential usability of a data record in terms of its level of data quality. Furthermore, the validation of the
proposal is performed through the application of a real dataset